Multi-View Attention Network for Visual Dialog

نویسندگان

چکیده

Visual dialog is a challenging vision-language task in which series of questions visually grounded by given image are answered. To resolve the visual task, high-level understanding various multimodal inputs (e.g., question, history, and image) required. Specifically, it necessary for an agent to (1) determine semantic intent question (2) align question-relevant textual contents among heterogeneous modality inputs. In this paper, we propose Multi-View Attention Network (MVAN), leverages multiple views about based on attention mechanisms. MVAN effectively captures information from history with two complementary modules (i.e., Topic Aggregation Context Matching), builds representations through sequential alignment processes Modality Alignment). Experimental results VisDial v1.0 dataset show effectiveness our proposed model, outperforms previous state-of-the-art methods under both single model ensemble settings.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Visual Reference Resolution using Attention Memory for Visual Dialog

Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual atte...

متن کامل

Dual Attention Network for Visual Question Answering

Visual Question Answering (VQA) is a popular research problem that involves inferring answers to natural language questions about a given visual scene. Recent neural network approaches to VQA use attention to select relevant image features based on the question. In this paper, we propose a novel Dual Attention Network (DAN) that not only attends to image features, but also to question features....

متن کامل

Visual Specification of Multi-View Visual Environments

We describe a set of visual tools for specifying and generating multi-view visual environments. JComposer provides an architecture description language for defining environment repositories, view models, and view-repository mappings. A visual event-flow language permits annotation of JComposer diagrams with event handlers specifying environment semantics. BuildByWire supports constraint-based v...

متن کامل

Multi-level Gated Recurrent Neural Network for dialog act classification

In this paper we focus on the problem of dialog act (DA) labelling. This problem has recently attracted a lot of attention as it is an important sub-part of an automatic dialog model, which is currently in great demand. Traditional methods tend to see this problem as a sequence labelling task and deal with it by applying classifiers with rich features. Most of the current neural network models ...

متن کامل

Fast and adaptive network of spiking neurons for multi-view visual pattern recognition

In this paper, we describe and evaluate a new spiking neural network (SNN) architecture and its corresponding learning procedure to perform fast and adaptive multi-view visual pattern recognition. The network is composed of a simplified type of integrate-and-fire neurons arranged hierarchically in four layers of two-dimensional neuronal maps. Using a Hebbian-based training, the network adaptive...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Applied sciences

سال: 2021

ISSN: ['2076-3417']

DOI: https://doi.org/10.3390/app11073009